In this session, we will use Black Friday Data available in Kaggle to study how to make the following graphical displays.
Here is a list of common arguments:
In this session, we will use the Black Friday Data available in Kaggle to study how to make the following graphical displays.
Categorical Data
Bar Chart
Pie Chart
Quantitative Data
Here is a list of common arguments:
In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contain the following variables.
Rows: 550,068
Columns: 12
$ User_ID <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
$ Product_ID <chr> "P00069042", "P00248942", "P00087842", "P00…
$ Gender <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
$ Age <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
$ Occupation <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
$ City_Category <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
$ Marital_Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
$ Product_Category_1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
$ Product_Category_2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
$ Product_Category_3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
$ Purchase <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…
Bar Chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company’s customers who purchased their products on Black Friday. Usage: barplot(height, …)
A bar chart can be horizontal or vertical. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure. We can use RGB color code to assign colors.
Note: The margin of a figure could be set using the par() function. The order of the setting is <span Sty;e=“color:orange”>c(bottom, left, top, right)
The shape of this distribution is slightly skewed right. The Age Group “26-35” has the greatest amount of observations, while the Age Group “0-17” has the least amount of observations.
Similarly, we can use pie chart to study the distribution of the city category.
Usage: pie(height, …)
Tip: Use color palette to choose colors (Google search: color scheme generator).
The City Category “B” has the most frequently observed in the data set, with 42%. The City Category “A” is the most infrequently observed, with 26.9%
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.
Usage: hist(x, …)
The center of the data looks to be around 7,500 British Pounds. The shape of this distribution is skewed right due to the presence of outlier-candidate data values at the upper end of the distribution.
In general, a boxplot is used When we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.
The median of this graph is approximately 7,500 British Pounds. This distribution is also skewed right due to the median being on the left side of the IQR of the boxplot. There are also several outliers above the upper fence of the boxplot contributing to the skewed right distribution.
Male & Single and Male & Married groups look to have approximately the same median, while the same can be said for both Female groups. The two Male groups have greater medians than the Female Groups. The Female groups, however, have more outliers present above the upper fence than the Male groups.
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn’t have another quantitative variable, we will use built-in data mtcars in R. Then we study the relationship of miles per gallon against the weight of vehicles.
This scatterplot depicts a linear, negative relationship of moderately-strong strength between weight and miles per gallon. There don’t seem to be any outliers present in this distribution.
Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 in 2022 9(The Weather Channel).
From this distribution, we can see that Houston, TX had the highest recorded temperature this, week, but it decreased linearly as the time progresses. The opposite happens for Dayton, OH, where the value recorded starts lower and increases linearly throughout the week. Fargo, ND, seems to have the lowest average temperature as they recorded the lowest temperature per day for every day except for the 14th and 17th. Denver, CO sees its peak temperatures during the middle of the week.
---
title: "Basic Graphical Displays"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: cerulean
navbar-bg: "lightblue"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(DT)
library(tidyverse)
library(plotly)
Friday<-read_csv("Black_Friday.csv")
```
Brief Overview 1
===
Column {data-width=450}
-----------------------------------------------------------------------
In this session, we will use Black Friday Data available in [Kaggle](https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
```{r}
```
Column {.tabset data-width=550}
-----------------------------------------------------------------------
### Graphical DIsplays
- Categorical Data
- Bar Chart
- Pie Chart
- Quantitative Data
- Histogram
- Boxplot
- Scatterplot
- Line
### Common Arguments
Here is a list of common arguments:
- col: a vector of colors
- main: title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a label for the x axis
- font: font used for text, 1=plain; 2=bold; 3=italic, 4=bold italic
- font.axis: font used for axis
- cex.axis: font size for x and y axes
- font.label: font for x and y labels
- cex.lab: font size for x and y labels
Brief Overview 2 {data-orientation=rows}
===
Row {data-height=100}
---
In this session, we will use the Black Friday Data available in [Kaggle](https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) to study how to make the following graphical displays.
Row {data-height=900}
---
### Graphical Displays
- Categorical Data
- Bar Chart
- Pie Chart
- Quantitative Data
- Histogram
- Boxplot
- Scatterplot
- Line
### Common Arguments
Here is a list of common arguments:
- col: a vector of colors
- main: title for the plot
- xlim or ylim: limits for the x or y axis
- xlab or ylab: a label for the x axis
- font: font used for text, 1=plain; 2=bold; 3=italic, 4=bold italic
- font.axis: font used for axis
- cex.axis: font size for x and y axes
- font.label: font for x and y labels
- cex.lab: font size for x and y labels
Data
===
Column {data-width=550}
---
### <b><font size = 4><span Style = "color:blue">First 500 Observations</span></font></b>
```{r show_table}
datatable(Friday[1:500,], rownames=FALSE, colnames=c("User ID", "Product ID", "Gender", "Age", "Occupation", "City Category", "Stay In Current City Years", "Marital Status", "Product Category 1", "Product Category 2", "Product Category 3", "Purchase"), options=list(pageLength=20))
```
Column {data-width=450}
---
### <font size = 4><span Style="color:red">Description</span></font>
In order to understand the customer purchases behavior against various products of different categories, the retail company "ABC Private Limited", in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contain the following variables.
- User_ID: User ID
- Product_ID: Product ID
- Gender: Sex of User
- Age: Age in bins
- Occupation: Occupation (Masked)
- City_Category: Category of the City (A,B,C)
- Stay_In_Current_City_Years: Number of years stay in current city
- Marital_Status: Marital Status
- Product_Category_1: Product Category (Masked)
- Product_Category_2: Product may belongs to other category also (Masked)
- Product_Category_3: Product may belongs to other category also (Masked)
- Purchase: Purchase Amount
```{r}
glimpse(Friday)
```
Bar Chart {data-orientation=rows}
===
Row {data-height=350}
---
###
Bar Chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company's customers who purchased their products on Black Friday.
**Usage:** barplot(height, ...)
A bar chart can be horizontal or vertical. Using the argument <span Style="color:orange">col</span>, we can assign a color for bars. The argument <span Style="color:orange">main</span> could be used to change the title of the figure. We can use RGB color code to assign colors.
**Note:** The margin of a figure could be set using the <span Style="color:blue">par()</span> function. The order of the setting is <span Sty;e="color:orange">c(bottom, left, top, right)</span>
### Analysis
The shape of this distribution is slightly <span Style="color:red">skewed right</span>. The Age Group "26-35" has the greatest amount of observations, while the Age Group "0-17" has the least amount of observations.
Row {data-height=650}
---
### **Vertical Bar Chart**
```{r bar1}
par(mpg=c(4,1,0)) #change the margin line for the axis title, axis labels, and axis line
par(mar=c(5,7,4,2)) #set margin of the figure
barplot(table(Friday$Age), col="lightblue", main="Distribution of Purchases by Customer's Age", ylab= "Number of Purchases", xlab="Age Group")
```
### **Horizontal Bar Chart**
```{r bar2}
par(mpg=c(4,1,0)) #change the margin line for the axis title, axis labels, and axis line
par(mar=c(5,7,4,2)) #set margin of the figure
Friday %>% ggplot(aes(x=Age))+geom_bar(fill="#69b3a2")+coord_flip()+labs(title="Distribution of Purchases by Customer's Age", x="Age Groups", y="Number of Purchases")-> bar1
ggplotly(bar1)
```
Pie Chart
===
Column {data-width=500}
---
Similarly, we can use pie chart to study the distribution of the city category.
**Usage:** pie(height, ...)
**Tip:** Use color palette to choose colors (Google search: color scheme generator).
### Analysis
The City Category "B" has the most frequently observed in the data set, with 42%. The City Category "A" is the most infrequently observed, with 26.9%
Column {data-width=500}
---
### Distribution of City Category
```{r pie}
H<- table(Friday$City_Category)
percent<-round(100*H/sum(H),1) #calculate percentages
pie_labels <- paste(percent, "%", sep="") # include %
pie(H, main="Distribution of City Category", labels= pie_labels, col=c("#54d2d2", "#ffcb00","#f8aa4b"))
legend("topright", c("A","B","C"), cex=0.8, fill=c("#54d2d2","#ffcb00","#f8aa4b"))
```
Histogram
===
Column {data-width=500}
---
###
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.
**Usage:** hist(x, ...)
```{r histogram}
Friday %>% ggplot(aes(x=Purchase))+geom_histogram(fill="blue")+labs(title="Distribution of Customer Purchase Amount", x="Purchase Amount (British Pounds)")
```
Column {data-width=500}
---
### Analysis
The center of the data looks to be around 7,500 British Pounds. The shape of this distribution is <span Style="color:red">skewed right</span> due to the presence of outlier-candidate data values at the upper end of the distribution.
Boxplot
===
Column {.tabset data-width=550}
---
In general, a boxplot is used When we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.
### Boxplot 1
```{r boxplot 1}
boxplot(Friday$Purchase, xlab="Purchase Amount", ylab="British Pounds")
```
### Boxplot 2
```{r}
boxplot(Purchase ~ Gender + Marital_Status, data=Friday, main="Distribution of Purchase by Sex and Marital_Status", xlab="Sex and Marital Status", ylab="Purchase", cex.lab=0.75, cex.axis=0.5, names=c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))
```
Column {data-width=450}
---
### Analysis of Boxplot 1
The median of this graph is approximately 7,500 British Pounds. This distribution is also <span Style="color:red">skewed right</span> due to the median being on the left side of the IQR of the boxplot. There are also several outliers above the upper fence of the boxplot contributing to the skewed right distribution.
### Analysis of Boxplot 2
Male & Single and Male & Married groups look to have approximately the same median, while the same can be said for both Female groups. The two Male groups have greater medians than the Female Groups. The Female groups, however, have more outliers present above the upper fence than the Male groups.
Scatterplot
===
Column {data-width=500}
---
###
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn't have another quantitative variable, we will use built-in data <span class="orange">mtcars</span> in R. Then we study the relationship of miles per gallon against the weight of vehicles.
```{r scatterplot}
plot(mpg ~ wt, data=mtcars, xlab="Weight (1000 lbs", ylab="Miles per Gallon", pch=19, col="blue")
```
Column {data-width=500}
---
### Analysis
This scatterplot depicts a <span Style="color:blue">linear, negative relationship of moderately-strong strength</span> between weight and miles per gallon. There don't seem to be any outliers present in this distribution.
Line Plot
===
Column {.tabset data-width=350}
---
### Data
Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 in 2022 9([The Weather Channel](https://weather.com/)).
```{r data}
Date <- 13:22
Dayton_OH <- c(84,86,91,89,89,91,92,91,91,91)
Houston_TX <- c(100,97,96,94,94,94,93,93,92,91)
Denver_CO <- c(95,85,89,96,97,96,92,91,95,96)
Fargo_ND <- c(86,80,84,87,90,87,83,84,87,89)
df<- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)
datatable(df, rownames=FALSE, colnames=c("Date","Dayton, OH", "Houston, TX", "Denver, CO", "Fargo, ND"))
```
### Analysis
From this distribution, we can see that <span Style="color:red">Houston, TX had the highest recorded temperature this, week, but it decreased linearly as the time progresses</span>. The opposite happens for <span Style="color:blue">Dayton, OH, where the value recorded starts lower and increases linearly throughout the week</span>. <span Style="color:green">Fargo, ND, seems to have the lowest average temperature as they recorded the lowest temperature per day for every day except for the 14th and 17th</span>. <span Style="color:purple">Denver, CO sees its peak temperatures during the middle of the week</span>.
Column {data-width=650}
---
### Line Chart
```{r line1}
plot(Date, Dayton_OH, type="o", col="blue", xlab="Date in July", ylab="Highest Temperature", ylim=c(80,100))
lines(Date, Houston_TX, type="o", col="red")
lines(Date, Denver_CO, type="o", col="purple")
lines(Date, Fargo_ND, type="o", col="darkgreen")
#Add a legend
legend("topright", #Position of the legend
legend=c("Dayton, OH", "Houston, TX", "Denver, CO", "Fargo, ND"), #labels
col=c("blue", "red", "purple", "darkgreen"), #colors
lty=1, #line types
pch=1) #point types
```